# Supplementary Materials for 'The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs'
Below a brief description about the contents of this folder can be found. Broadly, it contains the data, human annotations, the prompt templates, a preview of the raw results, and the code used for the evaluations.
In the section "Check installation and code" below an end-to-end script that tests if the code runs properly is provided.

**Quick access**  
- [ALl results in a big file](https://drive.google.com/file/d/1IfFQe2GYJduMTWz7tdf081ZzvT7CCUAc/view?usp=sharing).
- [Results per model and compute per model](https://drive.google.com/file/d/1e7PHYp-DsvoPuUEMVfy8WGSZ-ZDKtUAY/view?usp=sharing).
- CoT prompts in `data/cot_template_<n>`.
- Human evaluation in `human evaluation`.
- Type labels per example (e.g. generalised or particularised) in `data/type_labels.csv`.
- All commands to run the experiments in `experiment_run_scripts`.
- [HuggingFace dataset](https://huggingface.co/datasets/UCL-DARK/ludwig)

## The data
The test set can be found at `data/test_conversational_implicatures.csv` and the dev set at `data/dev_conversational_implicatures.csv`.
As mentioned in the paper, the data is taken from George & Mamidi, 2020. They published the data under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.
The data is given in utterance, response, implicature tuples and these can be wrapped in the prompt templates with the code, as detailed below.
The type labels for the error analysis are in `data/type_labels.csv`.

## The human evaluation
As described in Appendix G in the paper, the human evaluation is done by dividing the data in four subsets and having five annotators
annotate each of the 150 examples in a subset (together giving 600 examples annotated by 5 unique annotators each). 
The subset files with annotations can be found in the folder `human evaluation`.

## The prompt templates
The prompt templates used for the zero-shot and few-shot evaluations can be found in `data/prompt_templates.csv`. These
are the six main prompt templates. The additional prompt templates used for the extra zero-shot experiment are in the file
`data/alignment_prompt_templates.csv`. Examples can be wrapped in prompt templates with the code as detailed below.

The prompt templates for the models for which we do not have likelihood access can be found in `data/prompt_templates_completion.csv`
and the ones for the chain-of-thought experiment in `data/prompt_templates_cot_eng.txt`.

The chain-of-thought prompt templates are each presented in printed format in `data/cot_template_<n>`

## All results
File `results/preview_results.json` contains a preview of the results grouped together produced by the head command 
on the full results file `head -50000 all_results.json`. The full results file (`all_results.json`) is too big to add
to the supplementary materials, but can be found on [Google drive](https://drive.google.com/file/d/1IfFQe2GYJduMTWz7tdf081ZzvT7CCUAc/view?usp=sharing). For now, the preview can be checked
to check the predictions and the numbers reported in the paper. For example, the results for column `k = 0` of text-davinci-001 in Table 83 Appendix K.10 in the paper
are at the top at the file:
```json
"openai-text-davinci-001": {
            "mean_accuracy": 72.30555555555556,
            "std": 2.8274721511328975,
            "template_results": {
                "prompt_template_1": 76.5,
                "prompt_template_2": 72.0,
                "prompt_template_3": 74.83333333333333,
                "prompt_template_4": 68.0,
                "prompt_template_5": 72.5,
                "prompt_template_6": 70.0
            },
```
Directly below the predictions per example per template are shown, for example the first one:
```json
"prompt_template_1": {
                    "0": {
                        "id": 0,
                        "original_example": {
                            "source": "",
                            "type": "no",
                            "utterance": "Is Marci grumpy?",
                            "response": "he's as gentle as a lamb",
                            "implicature": "no"
                        },
                        "true": "no",
                        "pred": "no",
                        "correct": 1,
                        "prompt_examples": []
                    }
```

The results for the CoT experiment can be found in `results/all_results_cot.json`.
The separate results per model can be found on [Google Drive](https://drive.google.com/file/d/1e7PHYp-DsvoPuUEMVfy8WGSZ-ZDKtUAY/view?usp=sharing) as well.
All results, plots, and tables can be produced from the `all_results.json` file.

## Running evaluations with the code

Before running, go over the check installation section below.

The exact evaluations done for the paper cannot easily be reproduced without:
(1) having access to OpenAI or Cohere credits,
(2) having access to enough compute to run the large open source models.

### Check installation and code

Make sure the directory `code` is your current working directory.

Developed with Python 3.9.10, so make a virtual environment with this version.

```bash
pyenv virtualenv 3.9.10 testcode
pyenv activate testcode
```

Rust is a dependency for `transformers` library, install compiler with:

```bash
>> curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

For mac with `arm64` the following requirements for `sentencepiece` also need to be installed :

```bash
brew install cmake
brew install gperftools
brew install pkg-config
```

Python requirements:

```bash
>> python -m pip install -r requirements.txt
```

To run tests for all the code:

```bash
>> pytest tests/.
```

To run an end-to-end code check:

Note that you need an OpenAI API key for this, which can be set in `static/openai_api_key.txt`. The costs
of running the test is very low, since we use the `ada` engine which is the cheapest, and we only evaluate 5 examples (5 x 6 x 2 = 60 API queries).
This would cost worst case if each query has 1000 tokens (which it doesn't nearly have) 60 x 0.0016 = 0.1 dollars.
In any case, alternatively one can run with a new OpenAI account with the free provided credits with less risk.

```bash
>> chmod a+x test_code_runs.sh
>> ./test_code_runs.sh
```

Expected output should be:

```bash
This script should not take more than a minute to run.
PASSED
```

Note that to run any evaluations on OpenAI's or Cohere's models you need to have API keys. Add these keys
to the two files in the folder `static` called `cohere_api_key.txt` and `openai_api_key.txt`. The former just needs
a single line with the key, the latter has the organization key on the first line and the API key on the second.

### API models: OpenAI and Cohere 

If you do have an OpenAI or Cohere key, add the former in `static/openai_api_key.txt` (first line org ID, second line API key).
And the latter in `static/cohere_api_key.txt`. Then the same command as above can be used
to run evals with OpenAI and Cohere models by replacing the `model_id` with the right identifier. For example:

NB: running the following command with a proper API key may cost money!

```bash
> python -m src.probe_llm +experiment=particularised ++model_ids=openai-davinci ++objectives=lm
```

and for Cohere:

```bash
> python -m src.probe_llm +experiment=particularised ++model_ids=cohere-xl ++objectives=lm
```

All commands to run the experiments from the paper can be found in `experiment_run_scripts`.

### Open-source models

These evaluations are run with [EleutherAI's eval harness](https://github.com/EleutherAI/lm-evaluation-harness), and to use this framework
the dataset needs to be available on Huggingface. The HF dataset identifier is `UCL-DARK/ludwig` (see [here](https://huggingface.co/datasets/UCL-DARK/ludwig)).

It can be loaded as follows:

```Python
from datasets import load_dataset

dataset = load_dataset("UCL-DARK/ludwig")
```

EleutherAI's eval harness allows running evaluations on large models that need to be loaded on multiple GPUs easily. For example with the following command:

```python
python main.py --model_api_name 'hf-causal' --model_args pretrained=facebook/opt-2.7b --task_name ludwig/${k}-shot  --template_names 'template_1,template_2,template_3,template_4,template_5,template_6' --device gpu
```

